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ABSTBACT 

This  thesis  describes  the  performance-measurement  sxperi- 
m-:nts  designed  for  a  number  of  backer.d,  relational  da-abase 
machine  configurations.  An  in-depth  study  of  -he  tests  and 
results  of  the  two  relational  operations,  namely,  selection 
and  projection,  on  a  specific  configuration  is  presented. 
In  addition , tests  are  made  on  the  ordering  capabilities  and 
performance   of    the    machine    configuration.  The    goal    of   the 

work  is  tc  lead  to  a  development  for  a  machine-independent 
methodology  for  benchmarking  the  selection  and  projection 
operations  and  on  ordering  capabilities  of  database. 
machines. 
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A.       BENCHMARKING    DATAEASE    MACHINES 

Benchmarks  have  long  been  a  means  for  making  effective 
comparisons  of  differing  hardware  configurations  and  hard- 
ware architectures.  As  early  as  1970  instruction  mixes  were 
formed  and  tested  over  varying  configurations  to  provide  a 
means  cf  comparison  between  installations.  The  early  works 
included  th«  Gibson  [Ref.  1],  and  Flynn  [Ref.  2],  mixes 
which  consisted  of  machine  instructions  ordered  by  instruc- 
tion class.  The  Gibson  mix  was  based  on  data  collected  from 
IBM  7090  installations,  while  the  Flynn  mix  used  programs 
run  at  IBM  System/360  installations.  There  has  been  sea? 
work  done  with  similar  approaches  at  the  language  level, 
predominantly  the  work  of  Knuth  [Ref.  3],  who  us=d  a  nix  of 
Fortran  statements  to  obtain  his  benchmark  parameters.  All 
these  approaches  involved  the  running  of  some  standardized 
mix  of  instructions,  either  machine  instructions  or  instruc- 
tions in  some  high-level  language.  They  used  the 
experimental  results  from  these  runs  to  conduct  an  analysis 
of    the  computer    syst=m  performance. 

1  .      A   Definition 

Eenchmar kina   is   a   term      used    throughout    the    industry 
in    a    myriad   of    differing   contexts.  In   each   case   ^he    ulti- 

mate goal  is  to  make  an  independent  measure  or  reinvent 
comparison  of  machine  capabilities.  These  comparisons  or 
measures  could  be  anything  from  the  throughput  to  the  speed 
cf  calculations  by  a  certain  internal  component,  but  in  the 
final  analysis  seme  jeasure  or  evaluation  of  performance  is 
desired. 


There  are  many  different  ways  of  evaluating  machine 
performance.  Many  manufacturers  provide  the  capability  of 
attaching  monitoring  systems  to  their  equipment.  These  may 
be  either  hardware  monitors,  which  physically  sense  the 
action  occuring  in  the  system  and  kaep  statistical  records, 
or  they  may  be  software  monitors  which  attempt  tc  perform 
the  same  function  with  software  hooks  that  ke??  track  of  the 
system  operation  and  give  the  operator  a  statistical 
analysis  of  the  machine  action  and  performance.  Software 
monitors  have  the  disadvantage  of  using  a  good  deal  or  the 
system  time  just  for  their  own  operation.  Though  hardware 
monitors  do  not  suffer  from  this  disadvantage,  they  require 
the  wiring  of  the  monitor  system  into  the  hardware.  The 
biggest  disadvantage  to  these  types  of  measurements, 
however,  is  the  inability  tc  make  comparisons  on  differing 
machine  configurations  end  between  different  manufacturers. 
Eenchmarks  attempt  to  solve  this  problem  by  forming  some 
standardized  testing  methodology  that  is  easily  transpor- 
table from  one  machine  to  another  machine,  Hlos^ 
importantly,  the  measurements  made  must  be  relevant  regard- 
less of  the  machines  benchmarked  and  give  an  accurate  means 
cf   comparison    between  these    machines. 

Therefore,  benchmarks  are  defined  to  be  certain  s  sts 
cf  instructions  that  will  test  all  the  capabilities  of  a 
machine  and  yield  seme  generic  set  of  data  that  will  give  ar. 
accurate  measure  of  that  machine  in  its  tested  configura- 
tion. This  data  will  then  give  the  observer  specific 
guidelines  for  making  relevent  and  general  comparisons  with 
similar   machines  and   configurations. 

2.      Database   Machine    Benchmarks 

With  the  advent  of  spacial-purpose  database  machines 
and  backend  database  machines,  a  new  field  of  application 
for    benchmarks    exists.        Previously,    benchmark  routines   have 


been  used  exclusively  for  the  testing  and  performance  evalu- 
ation of  large  general-purpose  mainframes.  With  the  proli- 
feration of  backend  processors  to  unload  specialized  tasks 
from  the  mainframe,  these  benchmarks  have  been  ineffective, 
because  the  computer  system's  capabilities  of  performing  the 
specialized  tasks  are  net  benchmarked.  Our  primary  concern 
is  with  the  benchmarking  of  specialized  backends  known  as 
database  machines.  In  this  context  we  mean  a  specialized 
processor  externally  linked  to  a  mainframe,  with  its  own 
special-purpose  hardware  and  software  for  database  manage- 
ment. Eackend  refers  to  this  externally-linked  and 
specially-built    machine. 

3 .       Objectives 

At  present  the  backend  database  machine  is  in  its 
infancy  in  the  commercial  marketplace.  Nevertheless,  the 
database  system  is  extensively  utilized  in  various  forms  and 
for  different  tasks,  exclusively  in  some  software  configura- 
tion operating  on  a  large  general-purpose  machine.  In  order 
to  provide  effective  database  functions  the  software-laden 
database  system  consumes  a  great  deal  of  the  mainframe's 
resources  which  severly  limits  the  usefulness  of  the 
mainframe   for  other    functions. 

This  has  started  a  trend  towards  the  backend  data- 
base machine,  one  that  can  reduce  the  time  -he  host  spends 
in  searching  and  updating  data  in  response  to  us^r  queries. 
This  greatly  increases  the  ultimate  usefulness  of  the  host, 
since  these  backend  database  machines  are  only  a  small  frac- 
tion of  the  total  system  cost.  The  database  machines  now  on 
the  market  have  been  implemented  using  microprocessor  tech- 
nology rather  than  fully -special! zed  hardware,  thereby 
keeping  their  costs  down.  As  the  market  expands  and  more 
progress  is  made  in  VLSI  technology,  we  can  expect  to  see 
more    specialized   hardware  at    even   lower   cost. 
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Cur  objective  here  is  to  develop  seme  basic  testing 
procedures  tc  benchmark  relational  database  machines.  This 
thesis  also  gives  account  of  test  results  performed  on  a 
specific  tackend  database  machine,  the  RDM- 11 00,  and  its 
various   configurations.  If    is    limited      to    the      results   of 

test  queries  in  the  operations  of  selection  and  projection 
and      ordering  capabilities.  In   addition      *c   this      thesis, 

there  are  three  ether  theses,  [Refs.  4,5,6],  which  describe 
in  detail  the  test  procedures  and  results  of  join  opera- 
tions, ths  generation  of  the  databases  used  in  the 
experimanfs,  and  the  ether  test  procedures  and  results.  The 
ultimate  goal  of  the  entire  project  is  tc  develop  and  iden- 
tify some  sets  of  gueries  that  can  be  used  in  evaluating 
database    na  chine   performance. 

B.       THE    BENCHMARKING    ENVIRONMENT 

Our  primary  emphasis  is  to  evaluate  the  performance  of 
the  system/machine  under  typical  operating  conditions.  In 
this  sq^s<^  a  standardized  workload  model  must  be  developed. 
This  includes  the  use  of  typical  user  queries  (transactions) 
in  addition  to  the  design  of  a  database.  In  terms  of  fh« 
database,  we  developed  a  paramater ized  database  generate- 
that  will  generate  cur  dataoases  with  attributes  according 
to  a  specified  format  and  with  values  from  well-defined 
domains  according  to  specific  distributions.  We  chose  this 
approach  so  that  we  could  predict  or  interpret  accurately 
the    results   of    any    given   query.  More    details   are    given   on 

•^he    cenfext    and    design  of   the    database   in    Chapter    II. 

Query   streams      are   developed   to      test   the   full      range   of 
possible      user    operations.  All   queries      are      in    forms     of 

selection,  projection,  or  join  operations  as  may  be  made  by 
a  typical   user.  The   actual    query   syntax      and    selection  of 

query    streams   is  discussed    further   in    Chapter    III. 
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In  addition,  the  environment  available  to  as  for  the 
test  runs  is  very  restricted.  There  are  no  hardware  or 
software  probes  available  at  the  time  of  testing,  nor  any 
statistical    information     on    the   backend    machine.  Our   only 

recourse  is  to  use  a  built-in  retrieve  function  that  will 
give        a        readout         of  the         database        machine        clock. 

Unfortunately,  the  clock  has  a  low  resolution,  1/60  of  a 
second.  A  system  call  is  executed  to  retrieve  the  time 
before  and  after  each  test  query,  thereby  providing  a  crude 
yet    consistent    time    measure . 

1 .      The    Hos  t 

The  actual  testing  is  dens  using  a  UNI  vac  1100/42 
host  system.  The  system  is  located  at  tne  Pacific  Missile 
Test  Center,  Point  Mugu,  California.  The  basic  database 
machine  used  is  the  BDM-1100,  which  is  a  Brixton-Lee  IDM-5G0 
modified  to  run  as  a  tackend  to  UNIVAC  1100  computers  by  the 
Amperif   Ccrp.    of   Chatsworth,    California. 

The  tasting  is  done  using  run-stream  queries  in  an 
interactive    environment.  These   queries   are   run      either   on 

site  at  Ft.  Mugu,  or  from  a  remote  terminal  set  up  at  the 
Naval  Postgraduate  School,  Monterey.  We  prefer  to  run  the 
test  queries  in  a  stand-alone,  single-user  mode  in  order  to 
minimize  the  effects  of  workload  variability  of  the  host 
machine.  in  the  event  that  the  queries  are  not  run  stand- 
alone, the  number  of  coincidental  users  is  very  low  and 
little  or  no  difference  is  observed  in  the  measurement  from 
one    run    tc   another. 

2  •      1M    Hos  t   Interface 

The  interface  between  Univac  and  the  RDM  is  via  a 
word  channel;  the  EEM  is  treated  as  an  I/O  device  by  the 
UNIVAC   mainframe.  The    standard   IDM      device    is      capable   of 

communicating  over    an  RS-232   serial   interface   or    an    IEEE-488 
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parallel  interface.  The  communication  board  of  the  IDE  at 
Pt.  Mugu  has  been  modified  to  be  compatible  with  the  3n.ivac 
system.  It  supports  tyte/word  channel  interface  with  a  2G0K 
byte/seccnd   capacity. 

The  driver  rcutines  on  the  Univac  host  handle  -he 
parsing  cf  the  user  queries,  and  translate  them  into  the  IDM 
internal    format.  The   host   also   handles      the   communication 

protocol  with  the  backend  machine.  The  backer.d,  in  addition 
to  performing  the  necessary  handshakes,  will  perform  the 
required  error  checks  and  cause  the  host  to  retransmit  in 
the    event   that    an  error   is    detected. 

3 .      Machine    Configurations 

The  IDDi-500  system  comes  with  different  amounts  of 
internal  cache  memory,  and  has  an  optional  accelerator 
board.  The  accelerator  is  a  high-speed  processor  designed 
to  perform  certain  common  relational  functions  in  order  to 
increase  the  overall  system  performance.  The  machine  can  be 
configured    to  hold       1-6    megabytes   of    information.  We   have 

run    tests  on  the   following    configurations: 

(1)  1/2-megabyte  cache   without    accelerator; 

(2)  2-megabyte    cache    with   accelerator; 

(3)  2-megabyte    cache    without    accelerator. 

The  first  of  thase  configurations  is  no  longer  marketed. 
The  standard  package  contains  1-megaDyte  of  cache  memory  and 
no  accelerator.  In  addition,  the  machine  used  in  our  tests 
is  linked  exclusively  to  the  Univac  1100,  and  is  equipped 
with  only  one  disk  controller,  with  access  to  two 
600-megatyte  disks. 
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C.       THE    EENCHMARKED    MACHINE 

We  chess  to  restrict  our  work  to  the  IDM-500,  a  rela- 
tional database  machine.  This  type  of  machine  is  relatively 
new  on  the  database  market.  Although  i4:  is  no4-  clear  that 
it  will  be  the  predominant  database  machine  architecture, 
the  latast  literature  and  current  trends  appear  to  indicate 
that  it  may  play  an  important  roler  at  least  in  the  short 
run. 

The  relational  model  is  intuitively  easier  to  use  and 
understand  than  ether  database  models,  and  it  appears  that 
it  will  significantly  contribute  to  lower  software  develop- 
ment costs.  Nevertheless,  fully-implemented  software 
relational  database  management  systems  have  severe  perfor- 
mance problems.  The  high  cost  of  performing  relational 
operations,  most  strikingly  the  join  and  projection 
operations,    underlies   the   problem. 

With  the  great  interest  in  the  relational  database 
models  and  the  advances  in  technology  that  permit  the  use  of 
special-purpose  processors  and  backeni  systems  to  perform 
the  majority  of  work,  we  feel  that  the  relational  database 
machine  will  play  an  important  role  in  the  database  manage- 
ment market.  The  Erittcn-Lee  IDM-500  is  one  of  the  first 
machines  to  take  advantage  of  this  technology  and  incorpo- 
rate it  into  a  relational  database  system  which  can  be  used 
as    a    backend  to    a   variety   of   mainframes. 

1 .      Modular    Design 

The  3ritton-I.ee  IDM-500  is  a  backend  relational 
database  machine  that  can  be  linked  to  one  cr  more  host 
computers.  Amperif  Corp.  markets  this  system  under  an  CEM 
agreement  as  the  RDM-1100.  Essentially,  the  system  is  a 
Erittcn-Lee  IDM-500  with  Amperif  providing  the  host  and 
backend   interface     software    to      communicate    with      the    rJnivac 
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1100    and   a      tost -interface    module.         Figure    1.1      depicts   the 
architecture  of      the    Eritton-Lee    machine.  From    new      en    we 

will    use    IDM-500   and    RDM-1100    interchangeably. 

The  backend  is        a  modular,  expandable, 

microprocessor-based   system    organized    around      a    central   high 
speed    bus.      Each   module   is    f uncticnally    oriented. 

2  •      Technology   and  Functionality    of    Modules 

The  RDM-1100  is  made  up  of  six  basic  modules  organ- 
ized en  a  central  high  speed  bus  (  see  Figure  1.1  again  )  . 
The    modules   perform    the    following  functions: 

a.  The    datatase   processor 

The  database  processor,  a  Z8000-based  micropro- 
cessor, supervises  and  manages  all  system  resources.  This 
processor   executes    mest   of    the   software    in    the  system. 

b.  The    datatase   accelerator 

The  datatase  acceleratcr  (an  optional  processor) 
is  a  high-speed  processor  with  an  instruction  set  specifi- 
cally designed  to  perform  and  optimize  certain  fur.ctiens. 
It  is  activated  by  the  database  processor  as  appropriate. 
The  accelerator  has  a  three-stage  pipeline  which  executes 
instructiens  at  up  tc  10  MIPS.  This  processor  can  initiate 
disk  activity  and  process  data  at  dr.sk  transfer  rates.  The 
acceleratcr  and  •'■he  RDM  software  are  so  configured  that  the 
majority  of  database  work  is  performed  by  the  accelerator 
under   the   direction    of  the    database   processor. 

c.  The    main    memory 

The  RDM  main  memory,  or  cache  memory,  is 
composed  of  6Uk-bit  dynamic  RAM  chips.  The  RDM  car  be 
configured  with  from  1-megabyte  to  6-megabytes  of  memory. 
This  memory  is  utilized  for  RDM  system  code,  disk  buffering, 
indices,    and  user  commands. 
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d.      The    internal   bus 

The  entire  system  uses  a  ccmmon  internal  bus 
system   fcr   inter -process or    communication   and   data    transfer. 

<= .      The    disk/tape   interfaces 

The  system  can  be  configured  with  up  tc  '4  disk 
controller  modules.  Each  controller  can  manage  from  one  tc 
four    disk    drives.  The   disk   controller   moves      data    between 

external  disks  and  the  RDM  main  meraery.  The  disk  contrcller 
is  desigr.sd  to  work  with  the  accelerator  which  can  process 
data      at    disk      transfer    rates.  An      optional  tape      control 

module  supports  up  to  eight  tape  drives,  which  can  be  used 
for  direct  disk-to-tape  backup,  data  loading,  and  RDM 
software    leading. 

f.      The    host    interface 

The  RDM  and  the  hes- (s)  communicate  via  the  host 
interface  module.  This  module  accepts  commands  from  one  or 
more  hosts,  performs  error  checking,  causes  the  host  to 
retransmit  if  an  error  is  defected,  and  informs  the  database 
processor  that  it  is  moving  a  command  into  the  cache.  Each 
host  in- erf  ace  module  can  handle  up  to  eight  hosts.  Hence, 
with  the  full  8  interface  modules,  a  maximum  of  64  hosts  can 
be  accomodated  by  the  RDM.  The  standard  interface  module 
supports  both  P.S-222  serial  interface  or  an  IEEE-488 
parallel    interface. 
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II.    THE    DATABASE 

In  our  benchmark  measures  on  The  RDM-1100,  it  is  impor- 
tant to  model  the  queries  or  transactions  to  be  processed, 
and  to  model  the  database.  The  performance  of  any  database 
system  depends  not  only  on  the  characteristics  of  the  data- 
base system,  but  also  on  the  size  and  structure  of  the 
database.  Considering  this  two-dimensional  problem,  we  wan* 
to  build  databases  where  the  values  for  each  attribute  may 
be  selected  from  well-defined  domains.  In  addition,  we  feel 
that  these  values  should  have  specified  and  well-formed 
distributions  to  aid  in  the  prediction  of  -he  response  se- 
for    any    given  query. 

We  have  built  a  parameterized  relation  generator,  a 
software  system  to  generate  relations  for  synthetic  data- 
bases. These  synthetic  databases  are  then  used  by  our  query 
stream  tc  simulate  the  activity  of  actual  users  on  -*he 
sytera.  Several  of  these  databases  are  built,  varying  the 
tuple  widths  as  well  as  the  number  of  tuples  per  relation. 
We  then  attempt  to  distribute  the  databases  on  the  disks  tc 
force  specific  actions  on  the  processor,  such  as  icin  opera- 
tions between  relations  or^  the  same  or  seperate  disks.  In 
this  manner  we  seek  tc  find  any  significant  difference  due 
to   the   distribution    and   location    of    the    data    on    disks. 

A.       THE    USE    OF    SYNTHETIC    DATA 

As  with  any  system  model,  it  is  important  that  the 
synthetic  data  adequately  represent  the  essential  character- 
istics cf  real  databases.  3y  utilizing  the  synthetic 
database,  we  can  represent  a  subset  of  the  real-world  data- 
base   and   save   time    and  space      for  not   accommodating   the   full 
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set  cf  the  real-world  database.  However,  the  crgar.iza*  ior. 
is  general  enough  to  provide  an  emulation  of  the  real  world. 
The  synthetic  databases  we  have  designed  includ  the  basic 
data  types  that  would  exist  in  a  real-world  database: 
integer,  character,  and  so  on.  For  attribute  values  we  have 
incorporated  both  sequential  and  random  orders,  as  well  as 
groupings  according  to  specific  discrete  distributions. 
These  are  more  fully  described  in  the  next  section.  Using 
this  format  we  can  net  only  accurately  predic-  the  outcome, 
i.e.  amounts  of  data  returned  by  a  query,  but  we  can  also 
easily  reproduce  the  databases  cr.  o-her  systems  for  further 
tests . 

B.       GENERATION    OF    THE   SYNTHESIZED    DATA 

When  designing  the  database,  our  first  concern  is  with 
the  physical  sizes  that  should  be  used.  The  relations  ius4 
te  large  enough  to  test  the  full  capacity  of  the  system, 
and  meaningful  anough  to  include  various  attributes.  For 
example,  *e  choose  tuple  widths  cf  100,  200,  1000,  and  2000 
bytes  with  the  maxitrum  tuple  width  being  limited  at  2000 
bytes    and   the  disk    access   being   performed    in    2k   blocks. 


Our      secend    consideration 


1  a  —  ac     *•  Vi  a 


relations 


should  be,  i.e.,  how  many  tuples  per  relation.  Again,  in 
order  to  test  the  system  for  both  large  and  small  relations, 
we  decide  on  relations  with  500,  1000,  2500,  5000,  or  10000 
tuples.  These  are  arbitrary  decisions.  The  relation  sizes 
are  multiples  of  the  smallest  number  in  order  to  facilitate 
comparisons    cf    the    test   results. 

Our  nsxt  consideration  is  the  actual  design  and  building 
cf  the  data  generation  tool.  We  envision  a  great  many  data- 
bases with  differing  configurations.  Thus,  an  interactive 
interface  to  a  generation  program  appears  to  b-^  the  most 
effective   approach.  Using  the   locally   available      IBM   3032 
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response      to      syteai 

basis.        The     tuple 

The      user   is      then 


VM/CMS  installation  and  PASCAL/VS  as  the  language,  an  inter- 
active system  is  built.  For  more  information  en  the  design, 
programming,  and  operation  of  this  tool,  please  see 
[Ref.    6]. 

Using  the  interactive  system,  tha  user  is  allowed  to 
define  the  format  cf  a  relation  in 
prompts,  on  an  attribute-by-attribute 
width  and  relation  siza  are  defined, 
allowed  to  'add*  attributes  to  the  tuples  one  after  another 
until   he    reaches  the    desired   limit. 

The  user  can  choose  from  several  methods  of  attribute 
value  generation.  Integer  values  can  be  seguential  or 
random  within  a  specified  domain.  Uniqueness  of  the  rar.don 
integer  can  be  assured.  The  integer  can  be  either  one,  two, 
cr  fcur  bytes.  Character-strings  car.  aisc  be  chosen,  either 
compressed  cr  unco a pressed,  in  a  collating  sequence  or  in 
some  random  order.  Character  string  values  can  also  be 
selected  from  enumerated  domains  either  randomly  or 
according  to  a  specific  discrete  distribution.  In  our 
prototype  the  discrete  distributions  are  limited  tc  multi- 
ples cf  5f.  The  user  is  also  aiven  the  opportunity  to  set 
the  naming  convention  for  each  relation  and  its  attributes. 
The  prototype  is  designed  and  inplemented  with  a  limited  set 
of  alternatives.  It  is  however  modular  for  adding  alterna- 
tives to  the  prototype,  such  as  exponential  or  normal 
distributions. 

We  use  a  standard  template  for  each  tuple  width.  A 
portion  cf  this  template  is  standard  for  each  relation  (  see 
Figure  2.1  ).  Each  relation  contains:  a  sequential-integer 
attribute,  a  4-byt e-integer , ■ key* ;  a  character-attribute 
'mirror',  which  is  identical  in  numerical  value  to  'key'  but 
stored  as  a  character  string  and  not  as  an  integer;  a 
random- integer-attribute  'rand'  of  4-byte  integers;  and  a 
character-string-attribute  'chars',  which  contains 


20 


J          100     BYTES 

1             200 

^TES 

I          IO0  0     a 

Y  TE^ 

2  00  0 

3  Y  T  E  S      1 

IHc.U 

TYPE 

1      FIELL> 

T  YPE 

FItLU 

TYPE 

1      FIELD 

r  Y  Pn  I 

1     <Et 

14 

KEY 

14 

KEY 

14 

1         KEY 

14      | 

) MI R9UR 

CI  I 

1      MIRROR 

Z  I  I 

1      MIRROR 

CI  1 

t     M  I  PR  OR 

C  1  1      1 

j      R  AMD 

I  4 

RAND 

14 

RANQ 

14 

R  AvO 

14      1 

j  UN  I 3 RAND 

I  4 

1      UNIURAND      14 

1         C  HA  R  S 

C6  3 

CHARS 

C  7Q      ! 

I      CHARS 

C4 

CHARS 

C  14 

P5 

C9 

P5 

C9      J 

♦  letter 

C  1        i 

LETTER 

CI 

P  10 

C9 

!         P  I  0 

C9     1 

!       P5 

C9 

1            P5 

C9         | 

P2J 

C  i 

i         P2  > 

CO      1 

t      J  10 

c*> 

1         P  t  0 

C9 

I        P25 

C9 

H?J 

C9      I 

1     *20 

C9        1 

P20 

C9         | 

t>30 

C  9 

1        P  33 

C9     1 

t     "25 

C9 

1         P25 

C9 

1         P  J5 

C9 

!        P40 

C9     1 

1      a35 

C9 

1         P30 

C9 

P40 

C9 

M50 

C9      I 

t     350 

CO 

1        P3b 

C9         | 

P45 

C9 

P60 

CO     | 

1      P75 

C9 

P4Q 

C9         1 

P50 

C9 

P70 

C9      1 

1     PHO 

C9 

i         P45 

C9 

Pt>0 

CO 

P  7  5 

CO     1 

P50 
1         PS5 

C9 
C9         | 

P65 

p  ro 

CJ          1 
CO         \ 

PHO 
P90 

CO      \ 

C9      I 

P60 

C9 

P7S 

C9 

y  I  0  J 

C9     1 

P65 

C9 

P80 

C9 

UP  10 

UC25S 1 

P70 

C9         | 

P85 

C9 

UP  20 

UC25b! 

P75 

C9 

P90 

C9 

UP  25 

UC255I 

P80 

C9 

PI  00 

C  ) 

UP  5  0 

UC255 1 

P85 

C9 

UP  10 

UC255 

UP  7d 

UC255  1 

P90 

C9 

UP25 

UC2  55 

UP  80 

UC2  5SI 

PI  00 

C9         J 

UP50 

UC255 i 

UP  I  00 

UC^Sol 

FIELD     TYPES 

C-     COMPRESSED     CHARACTER     STRING 
(MAXIMUM     OF     255     CHARACTERS) 

Ul     -     UNCOMPRESSED     CHARACTER     STkIMG 
(MAXIMUM     OF     255     CHARACTERS) 


14     -     FOUR-BYTE     INTEGER 

THIS     FIELD     MAY     CONTAIN     ANY      INTEGER 
-2.  14  7.  483.  64  H     AND     ••  2  .  I  4  7  •  '*  d  3  .  64  7 


V  ALUfc      3E  T  *EEN 


Figure  2. 1    Tuple  Templates, 
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and   2  000 
the   precise 


characters    in  a    collating  sequence.  The    number    of   charac- 

ters   in    'chars'    is    dependent   on  the   tuple   width,    in   order   to 
ensure   that    tuples      are   exactly  -tOO,      2  00,         1000 
bytes    wide.  The    length  of    'chars'       is   set    to 

number  of  characters  required  to  ensure  that  the  -upie  is  of 
the  proper  width.  The  random  field  is  present  to  aid  in 
randomizing  the  order  of  the  tuples  and  the  purpose  of  the 
mirror  field  is  to  compare  the  performance  cf  identical 
retrieve  operations  bas=d  on  queries  qualified  on  the 
sequertial-inte  ger-attribut  e,  '  key  • ,    and  the  character- 

attribute,  'mirror'.  The  100-byte  and  200-byta  tuples  also 
contain  a  sequential-unit-letter  field  of  1-byte  character 
in        collating        sequence,  'letter',  and        a        unique 

random-integer-attributs   of    4-byte   integers,     'uniqrand'. 

Each  template  is  then  filled  out  with  attributes  for 
which  the  values  are  chosen  from  a  number  of  enumerated 
values.  For  example,  the  P10  attribute  specifies  attribute 
values  with  a  uniform  distribution  over  ten  unique  values. 
A  retrieve  statement  with  one  qualifier  could  then  be 
written  re  retrieve  10"*.  of  the  tuples  in  the  relation.  The 
number   of   such    fields    is    dependent   en    the    tuple    width. 

Once  the  design  cf  the  databases  is  complete,  multiple 
instances  cf  each  relation  are  built  using  the  interactive 
generaticn  tool  on  the  IBM  3033.  The  relations  are  then 
transferred  to  tape  storage  for  transport  to  Pt.  Kugu  and 
the    UNIVAC    1100.  The    data    is    loaded    cntc      the    UNI  VAC    1100 

disks  and  then  loaded  to  the  backend  database  machine  using 
a  bulk-load   utility. 

Teste  are  planned  on  the  basis  of  an  assumed  capability 
to  ccntrcl  the  distribution  of  the  data  on  the  RDM  1100 
disks.  The  capability  to  direct  a  relation  to  a  specific 
disk  is  net  implemented,  althcugh  the  space  allocation  for  a 
database  can  be  split  across  multiple  disks.  The  pattern  of 
block  ailccation  for  relations  within  the  database  is  cont- 
rolled  within  the   database    machine,    and    is    not   predictable. 
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III.    THE    PJJERY    LANGJOAGE 

The  interaction  between  the  user  and  the  RDM-1100  is 
through  the  software  interface,  RQL  (relational  query 
language),      provided    by   Amperif.  The    interface    translates 

the  user's  RQL  command  into  the  backend-machine '  s  internal 
format  and  sends  the  formatted  command  to  the  RBM-11QQ. 
The  software  requirement  for  the  host  is  minimal,  and  the 
back  end    machine    is   independent  of   -.he    host. 

When  performing  the  test  runs,  the  test  queries  are 
grouped  into  run-streams  in  order  :o  make  mor-  sfficien*  use 
of  the  available  time.  The  time  provided  for  our  test  runs 
has  been  very  restricted.  Since  we  prefer  to  make  our  test 
runs  in  a  stand-alone,  single  user  environment  to  minimize 
*he  host  workload  variability,  we  are  forced  to  execute  cur 
run  streams  during  the  evenings  and  on  weekends.  In  addi- 
tion we  want  to  run  sets  cf  tests  over  several  system 
configurations.  This  again  reduces  the  overall  time  for  us 
tc   run   our   performance  *•  est  s   on    each    configuration. 

Additional  constraints  are  imposed  by  the  nature  of  the 
interface  software  provided  by  Amperif  and  by  the  configura- 
tion cf  the  machine  at  Pt.  Mugu.  Pre-compilat ion  of  the 
queries  is  not  supported.  We  therefore  have  chosen  tc  use 
the  stored-commands  facility  of  the  backend  machine  to 
reduce  varability  in  the  parsing  time.  The  stored-commands 
facility  allcws  the  user  to  store  the  parse-trees  produced 
by  the  interpreter  as  named  commands  in  a  relation  in  the 
user's  database.  When  these  stored  commands  are  invoked  at 
a  later  time,  the  parsing  is  reduced  to  a  minimum.  Using 
the  stored-command  facility  also  eliminates  the  time 
required  to  look  up  target-list  and  qualification  attributes 
in   the   data    dictionary. 
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A.        SYNTAX    AND    SEMANTICS 

The  basic  operations  involved  in  retrieving  da-:a  in  a 
relational  system  are  selection,  projection  and  join.  This 
section  will  provide  a  basic  overview  of  the  syntax  of  the 
Relational  Query  Language  (RQL)  ,  with  pertinent  examples. 
For  a  more  detailed  explanation  of  the  language  as  well  as 
+he      database      administrator      functions,  please      refer     to 

[Ref-  5].  This  thesis  focuses  exclusively  on  the  sslec-ion 
and  projection  operations.  The  interested  reader  is  encour- 
aged to  read  [Ref.  4],  for  an  explanation  and  evaluation  of 
the  join  operations  as  performed  on  the  RDtf-1100  and  its 
various   configurations. 

Simple    selection    in    RQL    is    expressed   as    fellows: 

RETRIEVE     (    A. ALL    )     WHERE    A. CITY    =    "CHICAGO" 

The  keyword  to  the  selection  operation  is  RETRIEVE.  The 
relation  referred  to  in  this  case  is  A  and  the  qualifier  ALL 
indicates  that  all  attribute  values,  i.e.  the  entire  tuple, 
are  to  be  returned  for  each  qualifying  tuple.  In  this 
example  an  optional  qualifier  consisting  of  a  single  predi- 
cate has  been  added,  WHERE  A. CITY  =  "CHICAGO".  This 
qualifier  restricts  the  tuples  returned  to  only  those  tuples 
in  which  the  city  attribute  has  a  value  of  "CHICAGO".  The 
qualifier  could  have  multiple  predicates,  related  by  any  of 
the  boolean  operators,  such  as  AND,  OP,  =  (EQUAL),  !=  (NOT 
EQUAL),    etc.      An   example   is: 

RETRIEVE     (A. ALL)     WHERE    A.CITY="CHICAGO"    OR    A. CITY=" MONTEREY" 
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In  this  c£St  the  backend  machine  will  return  all  the  tuples 
in  the  relation  A  in  which  the  city  attribute  has  either  the 
value    "CHICAGO"    or    the   value    "MONTEREY". 

The  selection  operation  restricts  the  tuples  to  be 
returned.  The  projection  operation  restricts  the  attribute 
values  of  a  tuple;  only  a  portion  of  the  attribute  values  of 
each    tuple    are    returned.      For    example: 

RETRIEVE     (A. CITY, A. NAME) 

In  this  case,  the  target  list  (A.  CITY  ,A .  NAME)  ,  specifies  the 
attribute  values  to  be  projected  out  of  the  tuple  and 
returned  to  the  user.  Only  the  values  of  attributes  CITY 
and  NAME  for  each  of  the  tuples  in  the  relation  A  will  be 
returned.  A  qualifier  (not  shown)  could  be  added  ae  in  a 
previous  example  to  limit  the  number  of  tuples  returned  to  a 
specific   subset    of    the  relation. 

Commands  like  these  make  up  the  bulk  of  the  queries  used 
in  the  selection  and  projection  tests,  with  varying  quali- 
fiers attached.  RQL  has  many  more  capabilities,  such  as  the 
aggregate  functions  and  the  EY  clause.  For  further  details, 
again    refer    to   [Ref.    5 }, 

B.       TEST    QUERIES 

The  test  queries  used  are  all  selection  and  projection 
operations  in  the  form  of  the  previous  two  examples. 
Qualifications  are  used  on  these  queries  to  select  given 
percentages  of  the  attribute  values,  as  well  as  given 
percentages  of  the  tuples  in  each  relation.  As  described  in 
Chapter  II,  single  qualifiers  are  used  on  the  attribute 
values  having  discrete  distributions  to  select  only  a  given 
percentage   of  each      relation.        Comparisons   are   made      on   the 
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backend  database  machine's  performance  as  the  p=rc?nt5g=  cf 
data  retrieved  is  varied.  This  variation  covc-rs  two  dimen- 
sions: the  percentage  of  tuples  in  a  relation  aid  the 
percentage  cf  attribute  values  in  a  tuple.  Additional 
testing  is  dene  on  single-tuple  retrieves  and  queries  using 
range  predicates  on  the  key  field.  Each  of  these  experi- 
ments is  described  in  further  detail  in  diapers  IV  and  V 
along  with  a  detailed  description  of  the  commands  used  to 
retrieve    the  data. 

1  •      liming   Consideratio ns 

As  mentioned  tef  ore ,  the  most  critical  restriction 
placed  on  the  performance  tests  is  the  lack  of  measurement 
tools.  There  are  no  monitors  available  to  keep  track  of  CPU 
or  I/C  activities  in  the  backend  database  machine.  The  only 
available  measurement  capability  is  a  measurement  of  elapsed 
time  that  could  be  extracted  from  the  backer d  database 
machine  clock,  which  has  a  resolution  of  1/60th  of  a  second. 
Our  prime  concern  in  this  performance  evaluation  is  to 
determine  the  effects  cf  varying  certain  parameters  on  a 
tackend  database  machine  and  gather  some  gross  overall 
measures.  In  this  sense,  therefore,  we  feel  that  the  rough 
measurements  afforded  by  the  backend  machine  are  siill 
acceptable    fcr    our    purpose. 

In  crder  to  determine  the  elapsed  time  in  processing 
a  query,  a  retrieve  command  to  extract  the  time  from  th- 
tackend  database  machine  clock  is  executed  before  and  after 
each    query.      The   retrieve  command   is    of    the    form: 

RETRIEVE    (    TIME    =    GETTIME  ()  )     GO 

GETTIME  is  a  system  function  of  the  backend  machine.  This 
command   is    used    to    print    a    time,      in    1/60   second    increments, 
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before   and      after  our  queries.  [Jsing    this      throughout    our 

experiments  we  can  get  gross,  yet  consistent  measurements  of 
total  time  required  to  execute  the  queries.  Even  with  this 
poor  resolution,  the  comparison  of  identical  queries  will 
yield  relevent  performance  comparisons  of  th  =  response  time 
of    the    back  end    machine. 

2  •      Objectives 

The  final  objective  of  these  tests  is  not  to 
generate  large  volumes  of  data  with  figures  of  retrieval 
times  for  particular  queries.  Our  primary  goal  is  to  make 
relevent  comparisons  of  the  machine  performance  as  the 
queries  are  varied  inside  specific  parameters.  To  this  end 
we  hope  to  make  some  judgements  of  th  s  overall  performance 
cf  this  particular  backend  database  machine,  but  more  impor- 
tantly to  gain  some  insight  into  the  testing  methodology  for 
fcackend  database  machines  in  general.  In  the  next  chapters, 
examples  of  the  run-streams  used  in  the  experiments  are 
given  along  with  graphical  representations  of  the  test 
results. 
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IV.    PIEPQJWANCE    EVALUATION    OF    THE    SELECTION   OPERATION 

A.  DEFINITION    OF   A    SELECTION 

Selection  is  a  means  for  the  user  to  retrieve  and 
examine  pertinent  information  from  a  relation.  The  user  may 
select  the  entire  relation  or  he  may  restrict  the  informa- 
tion returned  to  him  in  two  ways.  He  may  limit  the  number 
of  tuples  returned  ty  adding  a  qualification  to  the  selec- 
tion operation.  The  qualification  will  limit  the  tuples 
retrieved  to  those  whose  attribute  values  satisfy  the  condi- 
tions of  the  qualification.  Qualification  consists  of 
predicates,  assertions  on  the  attribute  values  of  the  tuple 
or  tuples.  Multiple  predicates  may  be  combined  with  boolean 
operators,  such  as  AND,  OR,  EQUAL,  NOT  EQUAL,  etc.  The  user 
may  also  restrict  tte  attribute  values  returned  by  expli- 
citly listing  those  attributes  which  he  desires,  a 
projection  of  the  relation.  This  is  further  described  in 
the    following  sections  of  this   chapter. 

B.  SELECTIONS    IN   THE    QUERY    LANGUAGE 

In  RQL  the  user  is  given  considerable  pcw^r  of  selection 
through  use  of  the  EZTRIEVE  command.  Using  the  100-byte 
relation  described  in  Table  2.1  as  a  format  for  a  relation 
A,    a    typical  RQL  selection    command  might    be: 

RETRIEVE     (    A. ALL    )     WHERE    A.  KEY    =    25 

In  this  command  the  keyword  RETRIEVE  is  used  to  signify 
selection,  the  A. ALL  indicates  that  all  attribute  values 
i.e.,      entire  tuples,      are    to      be    returned,       ar.d   the    keyword 
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WHERE    identifies   the    quantifier.  The    A.  ALL    may    be    replaced 

with    an    explicit   listing   of      those  attributes    iesired.         The 

attributes      may    be      listed    in      any  order      the   user      desires. 
Osing    the   key  word    WHERE    and      a   qualification,      the    user    may 

then    indicate   which      cf    the    tuples  are    to      be    returns!.        In 

this    example,      only    those   in    which  the    KEY    field    is    equal   to 

25   are   returned.      The   user    may   use  other   operators    such   as  < 

or   > r      and    is      given    the   option   to  use    more      than    one    predi- 
cate.     For   example: 

RETRIEVE    (     *.ALL)     WHERE    A. KEY    >     25    AND    A. KEY    <     100 

would    return  all      tuples   with    the   KEY    field   in  the    rang?    26 

through    99.        The    user  is  given   great    latitude  in    delimiting 

the    subset    cf      the    relation    he   desires.           For  more    detaiie<? 

information     concerning   the      capabilities      and  syntax,         the 
reader    is   encouraged   to    read  [Bef.    5]. 

C.        AN    ENVIRONMENT    FOR    THE    MEASUREMENTS 

The  results  discussed  in  this  chapter  are  from  tests 
performed  or  the  system  configuration  with  2-megabyte  cache 
memory  and  -he  optional  accelerator.  Lack  of  time  prevented 
a  significant  number  of  tests  on  alternate  configurations 
for  comparison.  However,  these  tests  can  be  conducted  on 
ether   configurations    without   modifications. 

As  described  in  Chapter  III,  the  timing  measurements  are 
the  fcacksnd  system's  response  to  a  retrieve  for  its  internal 
system  clock  time  in  1/60-second  resolution.  In  most  cases 
the  measurements  are  based  on  single  queries  due  to  the  time 
involved.  Some  measurements  are  averages  over  several  query 
responses;  these  are  differentiated  in  the  sections  which 
follow.         In  all      cases  the    tests   are    runs      performed   in   the 
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evenings  and  weekends  with  virtually  no  other  users  or.  the 
sytem  . 

D.       SELECTION    MEASUREMENTS 

The  figures  in  the  first  section  represent  results  gath- 
ered for  selections  with  and  without  indici-s.  The  number 
of  tuples  returned  is  restricted  to  a  fixed  proportion  of 
the  total  number  of  tuples  in  the  relation;  no  projection  is 
involved.  The  final  sections  give  comparisons  of  the  system 
ordering  capabilities  on  the  frontend  as  well  as  the 
backend,    and  the   effects   of    data    compression. 

1  •      The    Pe  r  c  en  t  a  q  e  oj   S  election 

Figures  4.1  and  4.2  show  the  system  response  time 
for  selection.  Figure  4.1  shows  measurements  on  a  da- -.base 
with  no  indicies;  Figure  4.2  shows  measurements  on  a  data- 
base with  a  non-clustered  index  on  the  P5  and  ?10 
attributes.  As  described  in  Chapter  II,  the  P5  and  F10 
attributes  are  attributes  whose  values  are  in  a  uniform 
distribution  over  the  corresponding  percentage.  The  P5 
attribute  values  will  be  20  unique  values  each  appearing  in 
5%  of  the  tuples  and  the  P10  values  are  10  unique  values 
each  appearing  in  10*  of  the  tuples.  The  queries  used  are 
gualified   en  the   P5    attribute.  Therefore,      fcr    each    query 

the  system  will  return  exactly  5%  of  the  tuples  in  the 
relation. 

As  evident  in  Figure  4.1  the  system  response  t ime 
increases  nearly  linearly  as  the  amount  of  data  returned 
increases.  As  expected,  the  larger  is  the  tuple  size;  the 
steeper  is  the  slope,  since  the  volume  of  the  data  increases 
more    rapidly  for  the    larger    tuple   size. 
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Figure  4.2  shews  th€  rssults  of  the  same  queries  run 
against  a  database  with  indicies  on  the  P5  and  P10  attri- 
butes. Comparing  Figure  4.2  and  4.1,  we-  notice  •'■hat  the 
overall  tim^s  are  greatly  reduced.  The  graph  still  shews 
nearly  linear  relationship  of  the  increasing  response  *im^ 
and  of  the  increasing  volume  of  data.  Further  discussions 
of   the   effects    of   indicies    follow   in    the   next   section. 

The  linearity  of  the  response  time  appears  to  indi- 
cate that  the  system  performance  is  bound  by  the  spe =d  of 
the  channel  between  the  host  and  the  backend.  The  larger 
the  volume  of  data  is  to  be  returned;  the  longer  the  channel 
will    te    active    in   order   to    transfer    the    data. 

2  •      Effects    of    Clust  er  e  d   and   jjon-Clust  ered  I  n  die  ins 

The  RDM- 1100  supports  two  types  of  indicies,  clus- 
tered and  ncn-clustered.  Creating  a  clustered  index  causes 
the  tuples  to  be  ordered  by  KEY  for  storage.  A  sparse  index 
containing   cne    entry      per  block   is  built.  A  non-clustered 

index,  on  the  other  hand,  contains  a  unigue  entry  for  each 
tuple  in  the  relation.  No  ordering  of  tuples  within  the 
relation    is    implied. 

Figure  4.3  shows  response  times  for  the  retrieval 
guery  with  no  qualification,  but  with  an  ordering  specifica- 
tion.     The    gueries   are  of   the    form: 

RETRIEVE     (A.  ALL)     ORDER    3Y    A.  KEY 

where  A  is  the  relation  name  and  KEY  is  an  attribute  in  A. 
In  an  ordered  retrieve,  the  tuples  are  sorted  in  the  backend 
machine  and  then  sent  to  the  host  for  display.  Similar 
queries  are  run  against  a  relation  with  no  index,  a  relation 
with  a  ncn-clustered  index  on  the  KEY  attribute,  and  a  rela- 
tion   with   a   clustered  index    on   ^he  KEY   attribute.      The 
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response  times  are  similar  throughout.  the  range  of  relation 
sizes.  The  indicies,  clustered  or  non-clustered,  provide  nc 
significant  improvement  for  this  range  of  relation  sizes. 
The  expected  results  would  have  shown  a  significant  improve- 
ment for  the  relation  with  a  clustered  index.  The 
similarity  in  response  times  may  indicate  that  the  RDM  serfs 
the  the  tuples,  even  though  the  tuples  have  been  in  sorted 
order  due  to  the  use  of  a  clustered  index  on  the  ordering 
attribute . 

Figure  4.4  shows  the  results  of  test  runs  en  rela- 
tions with  and  without  non-clustered  indicies  on  the  P5  and 
P10  attribu-es.  The  graph  shews  a  significant  improvement 
in  response  times  for  the  relations  with  the  non-clustered 
index.  Locking  at  Figure  4.5f  the  improvement  ratio  is  made 
more  evident  for  simply  qualified  retrieves  when  the  index 
is  on  the  attributes  used  in  the  predicates  of  the  qualifi- 
cation. The  larger  is  the  tuple  size;  the  greater  becomes 
the    improvement.  The    200 -byte      tuple    shows      a    nearly      95% 

increase    in    the    response   time.  The   other  tuple    sizes   show 

similar   improvements. 

3  •      2  if  ect  s    of    Data    Com  press  ion    on    Selection    Quej^ies 

The  backend  database  machine  has  the  capability  of 
storing  character  strings  in  either  compressed  or  uncom- 
pressed format.  A  character  string  in  compressed  format  is 
stored  on  the  disk  with  no  trailing  blanks.  The  advantage 
is   a    savings  in    disk    space.  The  tradeoff   is   the    increased 

CPU  time  required  to  compress  and  uncompress  the  s-.rings  as 
data  is  moved  to  and  from  the  disk.  Figure  4.6  shews  the 
results  of  test  runs  on  relations  having  only  uncompressed 
attribute  values  and  on  relations  having  only  compressed 
attribute  values.  In  the  initial  test  runs  the  relations 
have  both  compressed  and  uncompressed  attributes  as  speci- 
fied in  Table  2.1,  in  order  to  ensure  the  correct  byte-width 
of   the   tuple. 
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Mere  specifically,  Figure  4.6  shows  the  result  of 
the  tes-s  for  the  relations  of  100- byte  tuple  size  and  -he 
2000-byte  tuple  size,  respectively.  For  the  100-byte  tuple 
the  storage  requirement  is  reduced  by  approximately  50%  when 
all    attributes      are    fully  compressed.  In   the    case      of   the 

2000-byte        tuple      size,  the       savings        in     storage        is 

approximately  90%. 

The  graph  shews  a  major  improvement  in  the  response 
time  for  compressed  relations.  From  the  sta=p  slope  of  the 
line  it  appears  evident  that  the  greatest  impact  on  system 
speed  is  the  amount  of  data  that  must  pass  over  ^he  internal 
bus.  The  large  reductions  in  tuple  size  for  the  compressed 
relation  shows  a  clear  advantage  over  the  uncompressed  rela- 
tion. Tha  delay  becomes  increasingly  significant  for 
relations  of  larger  ^ruple  sizes.  Approximately,  a  delay 
factor  of  10  for  the  larger  tuple  size  and  10000-tuple  rela- 
tion   is   observable. 

**  •      Effects      of    Ordering      and      Randomizing   the      Database 
Bnt  ties 

Figure  4.7  shews  the  results  of  tests  lc  measure  the 
backend   system's   sorting   capabilities.  The   relations   used 

are  stored  in  the  backend;  their  tuples  are  ordered  en  their 
KEY      attributes.  The      graph    depicts      retrieves      with      and 

without  ordering  specifications  on  the  KEY  attribute.  There 
is  a  slight  increase  in  the  responsa  time  for  the  ordered 
retrieves,  as  might  be  expected.  The  differential  line 
depicts  the  extra  time  necessary  for  the  ordering,  which 
increases   as  the  relation   size   increases. 

Figure  4.8  shews  the  cost  of  performing  the  ordering 
on  the  backend  versus  the  host.  In  this  case  batch  runs  on 
the  host  a:?  used  to  perform  the  queries.  In  general,  the 
batch  retrieves  show  a  marked  improvement  in  response  time 
for    identical  queries   over    the   run-stream   queries    used    in 
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Figure  4.7.  This  may  be  due  to  th9  decreased  overhead  cost 
for    batch      versus  an      interactive   environment.  Figure   4.8 

also  shews  that  for  smaller-size  relations  the  backend 
performs  a  more  efficient  ordering  than  the  host  dees.  Fven 
for  larger  relations  the  sort  time  of  -che  host  and  the  scr+ 
time    of   the    tack  end    are   comparable. 

Finally,  Figure  4.9  shows  the  effect  cf  randomizing 
the  order  of  the  tuples  in  the  relation.  Using  the  random- 
number  attribute  to  scatter  the  tuples  in  the  relation, 
similar  retrieves  are  performed  on  the  ordered  and  random- 
ized relations.  In  this  case  there  is  a  non-clustered  index 
on  the  KEY  attribute  for  the  relations.  Ihe  graph  shews 
minor  variances  in  response  times  between  the  two,  clearly 
indicating  that  the  order  in  which  the  tuples  are  stored  is 
net  a  significant  factor  in  response  time  for  the  ordered 
retrieves. 

E.       CONCLUSIONS 

The  response  times  are  generally  linear,  increasing  as 
the  amount  of  data  to  be  returned  is  increasing.  The  amount 
cf  data  may  be  varied  as  the  number  of  tuples  in  a  relation 
or    the    width  of    the    tuples. 

The  creation  of  indicies  on  tuples  shows  significant 
improvement  in  response  times  when  the  retrieve  command  is 
qualified  on  the  indexed  attributes.  The  indicies  provide 
marked   improvement    as  the  tuple   size    increases. 

The  effects  of  data  compression  shows  some  interesting 
results.  Figure  4.6  has  shewn  a  very  large  improvement  for 
compressed  tuples.  This  improvement  is  most  likely  attribu- 
table to  the  decrease  in  the  number  of  disk  blocks  accessed. 
In  fact,  the  difference  in  time  is  proportional  to  the 
decrease    in   the    number  of  blocks    used    for    the   tuples. 


42 


L 
O 
♦J 
<0 

(_ 

© 
O 
O 

CO 

C 


C 

u 
u 

Q> 
♦-» 
>» 

n 

l 

CM 


o 
_o 
o 
in 


o 

lo 
o 


o      o 

^         CM 


H'l'I'I'I'l'l'I'l'H'l'M'I'l'I'M'l'I'l'l'I'hl'i'l'H  o 


o 
o 

o 


0) 
N 


U5 


C 

o 


(O 

a> 


o 

CM 


o 
o 
o 


o 
o 


o 

CM 


o 

CM 


O 
CM 


O 
CM 
CM 


O 
O 
CM 


O 

CO 


o 


o 


o 

CM 


o 
o 


o 

CO 


o 


o 


o 

CM 


pes)   oiiji;  asuoassj 
43 


o 

+' 

(d 

i-H 

«D 

03 

e 
o 
•a 

a 

(0 

OS 

a 

0) 
M 

<o 

M 

O 

o 

en 
a 

•H 

n 

M 
O 

•4-4 
O 

W 

-p 
o 
fl» 

w 


3 
•H 


Finally,  the  ordering  test  shows  that  the  backend  can 
sort  tuples  at  least  as  fast  as  the  host  can.  Naturally, 
the  majoi  portion  of  the  time  is  spent  in  transferring  the 
data  frca  the  disk  tc  either  the  host  or  the  backend;  but, 
nevertheless,  the  tacker.d  proves  more  efficient  for  the 
smaller    size  of    relations. 
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V.    PERFORMANCE    I VALUATION    OF    PROJECTION    OPERATION 

A.  DEFINITION    OF    A    PROJECTION 

Projection  is  a  means  to  restrict  the  amount  and  to 
order  the  sequence  of  information  returned  to  the  user  in  a 
retrieval   operation.  More   specifically,         projection   will 

restrict  the  attribute  values  that  will  be  returned  from 
each      tuple     selected.  Projection    and      selection      can      be 

combined  to  limit  the  range  of  values  returned.  In  addi- 
tion, a  user  can  rearrange  the  ordering  of  the  attribute 
values  as  the  relation  is  displayed  by  varying  The  order  of 
the  attribute  names  in  the  target  list.  This  is  not  *c  say 
that  the  actual  order  of  the  stored  relation  is  altered  but 
that  the  subset  displayed  to  the  user  is  ordered  according 
to    his   specifications. 

B.  PROJECTIONS  IN  TEE  QUERY  LANGUAGE 

In  FCL  the  user  is  given  considerable  latitude  to 
describe  precisely  which  attribute  values  that  he  wants  to 
be  returned.  Using  the  100-byte  relation  described  in  Table 
2.1    as   a    format    for    a   relation   A,    the   RQL   command: 

RETRIEVE     {     A. KEY, A. MIRROR    ) 

will  return  to  the  user  only  those  attribute  values  in  the 
relation  A  whose  attribute  names  are  KEY  and  MIRROR.  The 
user  can  list  as  many  attribute  names  as  he  desires  and 
place  them  in  any  order  in  the  target  list  of  the  RETRIEVE 
command.  In  the  case  where  all  attribute  values  of  a  rela- 
tion   are   to    be    listed,      the    user    may    simply   use    A. ALL.         All 
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attribute  values,  i.e.,  entire  tuples,  will  be  returned  in 
crder  as  they  are  stored.  The  user  can  also  add,  qualifiers 
to  restrict  the  number  of  tuples  returned.  These  qualifiers 
need    not    te   on    the    attributes    listed.      For    example, 

RETRIEVE     (    A. KEY, A. MIRROR    )     WHERE    &.P5    =    "RED" 

will  again  return  to  the  user  only  those  attribute  values  in 
the  relation  A  whose  attribute  names  are  KEY  and  MIRROR.  In 
addition,  the  qualifier  will  restrict  the  tuples  returned  to 
those      whose  ?5      attribute      value      is    RED.  This      RETRIEVE 

command  also  illustrates  the  means  to  perform  a  percentage 
selection.  The  P5  attribute  values  are  colore  selected  from 
an  enumerated  set.  Each  different  color  value  in  the  ?5 
attribute  is  present  in  5<*  of  the  tuples  in  -he  A  relation. 
Using  these  known  percentages,  the  P5  qualification  will 
select    exactly    5$  of    the   tuples   in   relation    A. 

C.       AN    ENVIRONMENT    PCR   THE    MEASUREMENTS 

The  projection  iieasure  ments  discussed  here  are  all  on 
the  same  system  configuration  with  2-megaby-. e  cache  memory 
and  the  optional  accelerator.  Lac'<  of  time  has  prevented  us 
from    obtaining    measurements    on   ether    configurations. 

The  projection  measurements  are  conducted  for  four  tuple 
sizes,  i.e.  100-byte,  200-byte,  1000-byte,  and  2000-byte, 
in  three  percentages  of  returns,  25%,  50*,  and  75%.  Th^se 
percentages  refer  to  the  number  of  attribute  values  in  ~h* 
tuple  that  is  returned.  With  the  exception  of  the  100-byte 
tuple  size,  these  are  exact  percentages;  in  the  100-byte 
case,  the  number  of  attributes  returned  was  29%  and  7  1%. 
This  is  due  to  the  tuples  in  the  100-byte  relation  having  14 
attributes.        A      strict    percentage      of    25%      and   75^      was   not 
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attainable.  Nevertheless,  they  ara  still  referred  tc  as  25* 
and  75%  projections.  Further,  the  retrieval  commands  are 
qualified  by  5%  and  10%  selections  in  order  to  reduce 
further   the   amount      cf  data    to   be   returned.  Each   query   is 

executed  10  times,  each  time  with  a  different  qualification. 
This  is  dene  to  eliminate  any  effects  due  to  -h°  location  of 
the  data  in  the  relation  and  provides  a  better  average 
respense   time. 

D.       PBOJECTICN    MEASOBEHENTS 

The  test  queries  used  are  qualified  on  the  P5  and  P10 
fields  of  the  relation  to  perform  the  aforementioned  selec- 
tion. Each  query  is  then  repeated  10  times  with  a  differen* 
qualifier.  The  figures  represent  the  average  respense  time 
for  those  ten  tests.  Each  graph  shows  the  response  time  in 
seconds  plotted  against  the  number  of  tuples  in  the 
relation . 

1  •      Per  cent  age   of  Project  ions   on    Non-K= v   Attributes 

Ir  general  the  difference  in  response  times  for  the 
five-percent  and  ten-percent  selections  is  negligible,  this 
is  particularly  true  for  the  smaller-size  relations. 
Doubling  the  number  cf  tuples  returned  in  a  query  can  result 
in  approximately  a  2  0?  increase  (i.e.,  1/3  second  increase 
in  the  respense  time  en  the  average)  in  the  smaller  tuples 
and  a  109!  increase  (i.e.,  7  seconds  on  the  average)  in  the 
larger  tuples.  Figures  5.1  and  5.2  show  the  results  of  a 
25%  projection  over  varying  tuple  widths,  with  Figure  5.1 
for  a  5%  selection  and  Figure  5.2  for  a  10%  selection.  As 
can  be  seen,  the  graphs  in  these  two  figures  are  nearly 
identical.  This  is  also  the  case  for  the  graphs  on  the  50T^ 
and  75%  projections.  For  example,  in  Figures  5.3  and  5.4, 
similar    graphs    fcr    the   5%  selection    with    50%   and    15% 
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projections  are  displayed  respectively.  In  each  graph  of 
the  aforementioned  figures  response  oimes  increase  almost 
linearly  as  the  relation  size  increases,  and  increase 
dramatically  as  the  number  of  attribute  values  returned 
increases. 

Figures  5.5  and  5.6  give  a  different  perspective  on 
the  same  data.  In  this  case  the  time  for  differing  projec- 
tion sizes  is  graphed  over  a  constant  tuple  width.  As 
expected,  the  greater  the  number  of  attribute  values 
returned,  the  larger  the  response  time.  Again  a  much 
steeper  slope  is  evident  in  Figure  5.6  for  the  bigger-width 
tup la . 

2.      Comparison    of   the    5  qui valent    Queries    on    Selection 

Figures  5.7,  5.8,  5.9,  and  5.10  show  the  differences 
in  the  response  time  as  the  number  of  attribute  values 
returned  per  query  is  varied.  In  each  graph,  the  tuple  size 
remains  constant.  In  addition  oo  zhe  varied  projection 
percentages,  a  fourth  line  representing  a  selection,  in 
which  ail  attribute  values  in  each  tuple,  (i.e.,  the  en-ire 
tuple)  are  returned,  is  added.  The  tes-  queries  used  for 
the  line  marked  'full  selec-.'  use  the  ALL  specification  +;o 
return      ail    attribute     values      in   each      tuple.  As    in      the 

projection  measures,  each  such  quary  is  repeated  10  times. 
The  5%  selections  are  done  on  the  P5  field  and  a  different 
value    is    used  in  the    qualifier    for  each   of    the    10    queries. 

As  would  be  expected  <=ach  figure  shows  a  marked 
difference  in  the  response  time  as  the  number  of  attribute 
values   returned    is    increased.  The    smaller- width    tuples   in 

Figures  5.7  and  5.8  show  a  nearly  linear  increase  in  the 
response  time  as  the  relation  size  (the  number  of  tuples  of 
the  same  tuple  width)  increases,  and  an  increase  in  the 
slope  of  the  line  as  the  projection  size  increases.  In 
Figure   5.7   the    r^sporse   time   for    fall   select    is    strictly 
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smaller  than  any  projection  time,  which  indicates  that  for 
the  smaller  tuples  the  back  end  does  a  strict  selection  prior 
to  extracting  the  attribute  values  specified  in  the  prcj~c- 
tion    gualifier.  As  the  tuple      width   increases,         the   full 

select  may  take  more  time  than  that  of  the  projection.  For 
the  200-byte  tuple  in  Figure  5.8,  the  full  selact  time  is 
again  nearly  linear,  and  the  times  are  slightly  more  than 
the  times  for  a  25%  projection.  The  difference  in  response 
between  the  full  select  and  the  25$  projection  steadily 
increases  as  the  relation  size  increases,  but  =v=n  so  the 
full    select   is    faster  than    the    50%  and   15%    projections. 

For  rauch-bigger-wid th  tuples,  Figures  5.9  and  5.10 
show  that  the  full  selec*.  time  is  higher  than  the  projection 
time  for  the  small  percentage  projections.  The  full  select, 
however,  has  a  much  smaller  slope,  thereby  crossing  the  iin<= 
of  the  projection  tine  and  eventually  showing  a  trend  of 
guicker   response   as      the   relation   size   increases.  Also   of 

particular  note  is  the  uniformity  of  the  curves  for  the 
varying  projections  in  the  1000-byte  and  2009-byte  tuples  in 
Figures  5.9  and  5.10.  In  contrast,  for  the  smaller  tuples 
the  lines  are  nearly  linear  with  increasing  slopes.  The 
lines  for  -he  larger  tuples  are  not  linear  and  the  slopes 
are    very   even. 

E.       CONCLUSIONS 

In  general,  the  projection  results  are  very  predictable 
in  that  the  response  time  is  nearly  linear  and  the  response 
time  increases  as  the  amount  of  data  returned  increases. 
The  amount  of  data  may  be  determined  by  either  the  relation 
size    cr    the    projection  size. 

The  full  select  comparisons  in  Figures  5.7,  5.8,  5.9, 
and  5.10,  on  the  other  hand,  show  seme  unanticipated 
results.  Instead    of     showing      a      clear    advantage      in      the 


59 


response  time  for  full  select  in  all  relation  sizes,  as 
might  be  -xfected,  the  results  vary  with  the  tuple  widths. 
In  the  smaller  tuple  width  as  depicted  in  Figure  5.7,  the 
full  select  appears  tc  run  faster  even  though  the  amount  of 
data      returned    is      greater.  For      the    200-byte      tuples     as 

depicted  in  Figure  5.8,  the  relationship  is  markedly  diffe- 
rent. For  the  larger  tuples  as  graphed  in  Figures  5.9  and 
5.10,  the  full  select  requires  more  time  for  -he  smaller 
relations.  Nevertheless,  its  advantage  becomes  evident  as 
the    relation  size   increases.  In   summary,      the    full-select 

operation  is  sensitive  to  the  width  of  the  tuples.  In  other 
words,  the  greater  is  the  tuple  width;  the  higher  is  the 
select  time.  The  full-select  operation  is  also  sensitive  to 
the  size  of  the  relations,  although  in  an  opposite  way. 
That  is,  the  larger  is  the  relation;  the  smaller  is  the 
select   time   in    proportion   tc   the    projection   time. 

It  is  difficult  tc  determine  what  effec"  ■'-he  cache  and 
accelerator  with  other  configurations  may  play  in  these 
tests.  A  need  exists  for  more  research  in  this  area  to 
verify  the  figures  and  collect  more  data  over  a  wider  range 
of  tuple  widths  and  relation  sizes  in  hopes  of  obtaining  a 
clearer  trend  to  the  relationship  of  the  full  select  an<5  the 
projections   as    the    widths   and   sizes    varies. 


60 


71.    CONCLUDING    REMARKS 

A.       OVERALL    OBSERVATIONS    OF    THE    MACHINE    PERFORMANCE 

The  experiments  described  in  Chapters  IV  and  V  show  seme 
predictable  results  as  well  as  seme  unexpected  surprises. 
Generally  "he  simple  seiec1"  operations,  with  or  without 
indicies,        display      expected    trends.  The     response      time 

increases  as  the  amount  of  data  to  be  returned  to  the  host 
increases,  as  shown  in  Figures  4.1  and  4.5.  A  similar  trend 
is  seen  for  relations  with  compressed  attribute  values.  As 
Figure  4.6  illustrates,  reduction  in  the  response  time  can 
be  significant  for  the  large  tuple  widths  where  the  degree 
of  compression  is  high.  The  relations  with  indicies  also 
show  expected  improvements  in  the  response  time  for 
retrieves   gualified    en  these   attribute   values. 

Seme  unexpected  results,  however,  are  seen  for  the  test 
results  dealing  with  ordered  retrieves,  Figure  4.8.  The 
tackend  shows  an  unexpected  superiority  in  sorting  over  the 
host  for  smaller-size  relations.  Evan  for  the  large  rela- 
tions, up  to  10000  tuples,  the  backend  maintains  a  response 
time  comparable  with  the  host.  One  would  expect  that  the 
mainframe  would  have  a  significant  advantage  in  computing 
power  and  show  a  major  improvement  when  the  relation  is 
ordered   in   the    host    instead    of   in   the    backend. 

Another  interesting  result  is  the  effect  of  clustered 
and  non-clustered  indicies  on  ordered  retrieves.  Creating 
a  clustered  index  on  a  relation  will  cause  the  tupiis  to  be 
stored  in  a  specific  order  while  a  non-clustered  index  does 
not  imply  any  ordering  of  the  tuples.  Figure  4.3  shows  very 
similar  response  times  throughout  the  range  of  relation 
sizes,        regardless    of      whether  the      index      is  clustered     or 
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non-clustered.  This  implies  that  the  reprieved  tuples  are 
sorted  even  when  a  clustered  index  exists  for  -he  qualifier 
attributes. 

The  tests  concerning  projection  of  tuple  attributes  in 
Chapter  V  again  show  predictable  results.  Through  all  the 
figures  for  differing  projection  percentages  and  tuple 
widths,  the  graphs  display  near  linearity  in  bo^h  dimen- 
sions. The  response  time  increases  as  the  tuple  width  or 
the    number      of    tuples     returned    increases.  But    surprising 

results  are  evident  when  comparing  projection  to  full 
selection. 

Consider  Figures  5.7,  5.8,  5.9,  and  5.10  again.  As 
explained  in  Chapter  V,  the  overlay  of  the  full  select  on 
the  varying  projection  sizes  shows  no  positive  trend.  The 
projection  measurements  are  consistent  throughout  the 
figures,  yet  the  full  selects  relationship  to  the  projec- 
tions varies  from  one  figure  to  the  next.  Two  of  the  four 
figures  indicate  that  it  is  cheaper  to  retrieve  entire 
tuples  than  to  project  attribute  values  from  the  tuple.  One 
figure  indicates  that  beyond  a  fixed  relation  size,  it  is 
cheaper  tc  retrieve  entire  tuples.  The  fourth  figure  sterns 
to  indicate  that  some  degree  of  projection  is  always  cheaper 
than    retrieving    the    entire    tuple.  No    clear   conclusion   can 

be  drawn.  More  ^ests  over  a  wider  range  of  tuple  widths  are 
required  to  identify  an  overall  trend  or  relationship 
between  projection  percentage  and  the  full  selection 
retrieves. 

B.        BATAEASE   AND    MACHINE    LIMITATIONS 

When  considering  the  test  environment,  two  specific 
limitations  stand  above  all  else.  The  first  of  these  is  the 
low  resolution  of  the  clock  from  which  measurements  are 
taken.  The      standardized      use 
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Throughout  the  tests  has  made  comparison  of  various  test 
results  over  differing  periods  meaningful.  Even  so,  the  low 
resolution  makes  the  need  for  average  times  over  many 
similar  test  runs  a  necessity.  This  greatly  limits  the 
amount  of  time  that  one  can  spend  in  running  mors  meaningful 
tests  and  in  verifying  previous  results.  A  great  effort  has 
been  mads  to  find  some  other  timing  mechanism.  In  the  end, 
GETTIME  proves  to  be  the  easiest  to  use,  the  most 
consistent,    and,    most   importantly,    the   easiest   to   control. 

The  second  limitation  concerns  the  system  configuration 
and  the  Inability  to  control  the  environment  of  both  the 
host  and  the  backend.  The  performance  of  these  tests  has 
not  been  a  very  high  priority  of  the  parent  command  at  Ft. 
Mugu.  This  is  to  be  expected,  since  the  host  machine  is  in 
a  production  environment.  Gaining  exclusive  use  is  very 
difficult  and  extremsly  costly.  With  this  restriction,  our 
tests  ars  limited  to  weekend  and  evening  runs,  at  times  of 
relatively  low  activity.  This  significantly  reduced  the 
time  of  system  a vailibility .  Also,  in  terms  of  the  environ- 
ment, the  tackend  system  we  used  is  a  relatively  new  piece 
of   equipment.  Lastly,      the      sytem    configuration      has    been 

changing  frequently  during  the  experimentation  period.  The 
time  each  configuration  becomes  available  has  been  short. 
Consequently,  not  enough  data  can  be  collected  to  make  any 
significant   comparisons. 

C.       RECOMMENDATIONS    FOR    FUTURE    BENCHMARKING    EFFORTS 

In  light  of  the  test  results  discussed  here,  the  direc- 
tion of  future  work  should  be  toward  effects  of  various 
indicies  and  ordering  capabilities.  The  results  of  tests  on 
various  types  of  indicies  and  the  ordering  of  relations  show 
the    most      startling    results.  In   addition,        some   work     is 

required  over  a  wider  range  of  tuple  widths  to  refine 
previcus   results. 
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Another  aspect  that  warrants  research 
to  simulate  a  more  realistic  system  load,  specifically  tests 
with  multiple  users  cf  the  backend  and  a  more  realistic  host 
workload.  The  tests  in  -his  thesis  are  runs  on  an  unloaded 
system.  In  actuality,  the  use  of  the  system  will  most 
likely  occur  closer  to  peak  loading.  Perhaps  different 
trends  may  develop  when  the  host  and/or  backend  are 
subjected  to  different  load    conditions. 

Even  though  these  tests  are  on  a  specific  system,  they 
are  general  enough  in  nature  to  provide  insight  for  tests  on 
other  relational  machines  and  to  aid  in  making  a  comparison 
cf   different   backends. 
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